The Credit Scoring Model (Risk Analytics) project delivers an end-to-end machine learning solution for assessing borrower risk in loan approvals, enabling automated and accurate decisions. It processes applicant data using Pandas, handles imbalances/outliers/missing values, builds a LightGBM model for predictions, applies feature selection and Platt scaling for calibration, and tracks experiments via MLflow for reproducibility. The system achieves 89% AUC-ROC, reduces default rates by ~30%, ensures calibration (Brier score <0.05), and complies with regulations, completed over 8.5 months from March to November 2025 for scalable financial risk assessment.
The architecture follows a comprehensive pipeline: data is loaded and preprocessed with Pandas for handling imperfections, engineered with feature selection, trained using LightGBM for binary classification (default probability), calibrated via Platt scaling for reliable scores, and managed through MLflow for experiment logging, artifacts, and lifecycle stages. This design ensures efficiency on large datasets, interpretability for risk analytics, and integration for loan systems, focusing on probabilistic outputs, cross-validation, and reproducible workflows for financial compliance.
The system uses Python for scripting and integration, LightGBM for gradient boosting modeling, Pandas for data manipulation and preprocessing, and MLflow for experiment tracking, metrics logging, and model registry. Additional libraries include Scikit-learn for imputation (KNN), balancing (SMOTE), selection (RFECV), calibration, and metrics; tools support hyperparameter tuning and versioning.
The risk model employs LightGBM for efficient binary classification on loan defaults, trained with parameters like 0.05 learning rate, early stopping, and AUC metric, on stratified splits (80/20). Features include income, credit history, etc., selected via LightGBM importance and RFECV for optimization; handling includes SMOTE for imbalance, IQR for outliers, KNN/median for missing values. Platt scaling (sigmoid calibration) adjusts probabilities, achieving 0.89 AUC-ROC and 0.04 Brier score, with global/local interpretability.
Data processing loads from sources (e.g., CSV) using Pandas, handles missing values (KNN imputation), outliers (IQR capping/removal), and imbalances (SMOTE resampling), engineers features, and selects via RFECV. Models are trained/calibrated, predictions logged in MLflow, with artifacts (params, metrics) stored for reproducibility, ensuring data quality, anonymization for privacy, and efficient handling of 100k+ samples in <5 minutes.
Testing includes unit validation for preprocessing and calibration functions, integration checks for pipeline flow, performance tuning for AUC-ROC >0.85 and Brier <0.05, and bias checks via stratified sampling. Deployment registers models in MLflow (Staging to Production), integrates with loan systems for predictions, uses phased rollout with anonymization, and supports rollback via model versions if issues arise.
Post-deployment, monitor model performance and drift via MLflow metrics tracking, periodic retraining on new data, and calibration checks, aiming for >99% uptime and consistent AUC. Maintenance includes quarterly updates for features/calibration, monthly audits for compliance/bias, and cost controls, with alerts for high-risk patterns to trigger reviews.